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Abstract 

We consider a dynamic pricing problem under unknown demand models. In this problem a seller 
offers prices to a stream of customers and observes either success or failure in each sale attempt. The 
underlying demand model is unknown to the seller and can take one of N possible forms. In this paper, 
we show that this problem can be formulated as a multi-armed bandit with dependent arms. We propose 
a dynamic pricing policy based on the likelihood ratio test. We show that the proposed policy achieves 
complete learning, i.e., it offers a bounded regret where regret is defined as the revenue loss with respect 
to the case with a known demand model. This is in sharp contrast with the logarithmic growing regret 
in multi-armed bandit with independent arms. 



Index Terms 

Dynamic Pricing, multi-armed bandit, maximum likelihood detection. 

I. Introduction 

The sequential pricing of a certain good under an unknown demand model is a fundamen- 
tal management science problem and has various applications in financial services, electricity 
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market, online posted-price auctions of digital goods, and radio spectrum management. In this 
problem, a seller offers a sequence of prices of the good to a stream of potential customers and 
observes either success or failure in each sale attempt. The characteristic of each customer is 
assumed to be identical and is described by a demand model p(p) which prescribes the probability 
of a successful sale at the offered price p. The demand model is assumed to be unknown to 
the seller and needs to be learnt online through sequential observations. Unlike the conventional 
operation research and management science constraint on the inventory, we assume that there is 
an unlimited supply of the good (consider, for example, the online posted-price auction where 
there is an infinity supply of the digital good). The objective is to maximize the total revenue 
over a horizon of length T by choosing sequentially the price at each time based on the sale 
history. When choosing the price at each step, the seller confronts a tradeoff between exploring 
the demand model (learning) and exploiting the price with the best selling history (earning). As 
the seller gains information about the unknown demand model from the past selling history, the 
seller's pricing strategy can improve over time. 

A. Dynamic Pricing as A Multi-Armed Bandit 

Dynamic pricing can be formulated as a special multi-armed bandit (MAB) problem, and the 
connection was explored as early as 1974 by Rothschild in [lj. A mathematical abstraction of 
MAB in its basic form involves N independent arms and a single player. Each arm, when played, 
offers independent and identically distributed (i.i.d.) random reward drawn from a distribution 
with unknown mean 6^. At each time, a player chooses one arm to play, aiming to maximize the 
expected total rewards obtained over a horizon of length T. Depending on whether the unknown 
mean 0, of each arm is treated as random variables with known prior distributions or as a 
deterministic quantity, MAB problems can be formulated and studied within either a Bayesian 
or a non-Bayesian framework. 

Within the Bayesian framework, system unknown parameters are random variables, and the 
design objective is policies with good average performance (averaged over the prior distributions 
of the unknowns). Often, the performance of a policy is measured by the total discounted reward 
or the average reward over an infinite horizon. By treating the player's a posteriori probabilistic 
knowledge (updated from the a priori distribution using past observations) on the unknown 
parameters as the system state, Bellman in 1956 abstracted and generalized the Bayesian MAB 
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to a special class of Markov decision processes (MDP) [O. In 1970s, Gittins showed that the 
optimal policy of MAB has a simple index structure — the so-called Gittins index policy [3]. This 
leads to linear (in the number N of arms) complexity in finding the optimal policy, in contrast 
to the exponential complexity one would have to face if the problem was solved as a general 
MDP. 

Within the non-Bayesian framework, the unknowns in the reward models are treated as 
deterministic quantities and the design objective is universally (over all possible values of 
the unknowns) good policies. A commonly used performance measure is the so-called regret 
(a.k.a. the cost of learning) defined as the expected total reward loss with respect to the ideal 
scenario of known reward models (under which the arm with the largest reward mean is always 
played). To minimize the regret, the player needs to identify the best arm without engaging other 
inferior arms too often. In 1985, Lai and Robbins [@) showed that the minimum regret grows 
at logarithmic order with T and constructed a policy to achieve the minimum regret for certain 
reward distributions. 

The connection between MAB and dynamic pricing is now readily seen: each potential price p 
is an arm with an unknown reward mean pp(p) (the expected revenue at price p). When the seller 
can choose any price within an interval, the problem becomes a continuum-armed bandit [|5)- 
||9l . Kleinberg and Leighton in [9] specifically consider an online posted price auction under 
an unknown demand model which is a special case of the continuum-armed bandit problem. 
In UJ, Rothschild considered the case where the seller can choose prices from a finite set 
and formulated the problem as a classic MAB within the Bayesian framework assuming prior 
probabilistic knowledge of the demand model. His focus was on the question whether a seller 
who follows an optimal policy (in terms of total discounted revenue over an infinite horizon) will 
eventually obtain complete information about the underlying demand model thus settle at the 
optimal price. It was shown in [[TJ that the answer is in general negative. In light of the theories 
on MAB developed since 1974, this conclusion is, perhaps, no longer surprising. The optimal 
policy of a Bayesian MAB will always settle at a single arm (after a finite number of visits to 
other arms) which is not necessarily the best one. Following Rothschild, McLennan showed that 
incomplete learning can occur even when the seller can choose among a continuum of prices [fTOl . 
McLennan adopted a simple binary demand model: it is known that one of two possible demand 
models po(p) and pi(p) pertains with prior probability 1 — q and q , respectively. 
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Even though the optimal policy offers the best performance averaged over all possible demand 
environments under the known prior distribution, the fact that incomplete learning occurs with 
positive probability can be unsettling given that a seller may only see one realization of the 
demand model and thus cares about only the revenue under this specific realization rather than 
on the average over all realizations it might have seen. In this case, a policy that grantees 
complete learning under every possible demand model may be desirable, even though it may 
not offer the best average performance. 

This issue was addressed by Harrison et al. in IfTT i] where they adopted the same binary demand 
model considered by McLennan ifTOl but focused on achieving complete learning under each 
realization of the demand model rather than the best average performance. The myopic Bayesian 
policy (MBP) and its modified versions were studied. It was shown that although MBP can lead 
to incomplete learning, a modified version of MBP will always learn the underlying demand 
model completely and settle at the optimal price. If we borrow the performance measure of 
regret that is often used within the non-Bayesian framework, complete learning implies a finite 
regret that does not grow unboundedly with the horizon length T. 

B. Main Results 

In this paper, we provide a different approach to the problem considered by McLennan in IfTOl 
and Harrison et al. in [fTTTl . In particular, we adopt the non-Bayesian framework which does not 
assume any probabilistic prior knowledge on which demand model may pertain. We show that 
completely learning (i.e., finite regret) can be achieved without this prior knowledge. Furthermore, 
in contrast to the modified MBP proposed in [fTTTl . our proposed policy achieves finite regret 
without complete knowledge of the demand curves {po(p), Pi(p)}- The only knowledge required 
in our proposed policy is the optimal prices {p^,pl} under each demand model and the values 
Pi(p*Ys e {0, 1}) of the demand models at these two prices. Our results also generalize to 
the case with an arbitrary number N of potential demand curves. 

Our approach is based on a multi-armed bandit formulation of the problem within the non- 
Bayesian framework. In our formulation, each arm 1 < j < N represents the optimal price p* 
under demand model Pj(p). Since all arms share the same underlying demand model, arms are 
correlated. In other words, observations from one arm also provide information on the quality 
of other arms by revealing the underlying demand model. Recognizing the detection component 
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of this bandit problem with dependent arms, we propose a policy based on the likelihood ratio 
test (LRT) and show that it has finite regret. Compared to ifTD . this result on complete learning 
is established in a considerably simpler manner. Furthermore, simulation examples demonstrate 
that the proposed LRT policy can outperform the modified MBP policy (CMBP) proposed in 

CD. 

By introducing exploration prices (the prices with the largest Chernoff distance between the 
two demand models that are currently detected as most likely), we show that a variation of the 
LRT policy can improve the rate of learning the underlying demand model and reduce regret. 
This enhancement, however, requires more knowledge on the demand curves than in the LRT 
policy. 

In the context of multi-armed bandit, this result provides an interesting case where dependen- 
cies across arms can be exploited to achieve finite regret that does not grow unboundedly with 
the horizon length T. This is in sharp contrast to a naive approach that ignores arm dependencies 
and directly applies the classic MAB policies. The latter would have led to a regret that grows 
logarithmic with T. 

C. Related Work 

Within the Bayesian framework, following Rothschild ( [HI), Easley and Kiefer [12J and 
Aghion et al. lfT3l also studied the achievability of complete learning Aviv and Pazgal in |fl4j 
considered parametric uncertainty in the demand model where a prior distribution of the unknown 
parameter is assumed known. They formulated the dynamic pricing problem as a partially 
observable Markov decision process (POMDP) and developed upper bounds on the performance 
of the optimal policy. They also proposed an active-learning heuristic policy with near optimal 
performance. Farias and Roy in lfl"5l considered a similar problem under inventory constraints and 
Poisson arrivals of customers and developed near optimal heuristic policies. Keller and Rady 
in |[T6l considered the case under infinite horizon with discounted reward where the demand 
model may change over time. In order to learn the underlying demand model, they studied 
two qualitatively different heuristics based on exploration (deviating from myopic policy) and 
exploitation (close to myopic policy). 

Within the non-Bayesian framework, besides regret, another metric mainly considered in the 
analysis of auction mechanism lfT71 - ll20l is the competitive ratio defined as ^j§m - where S is 
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the seller's strategy, S opt is the optimal fixed-price strategy under the known demand model, and 
R(.) is the expected revenue function. Blum et al. showed in [|T8l that there are randomized 
pricing policies achieving competitive ratio 1 + e for any e > 0. This result indicates that R(S) 
can converge to R(S opt ) but it does not reveal the rate of convergence. In this paper we analyze 
the additive regret R(S opt ) — R(S) (a more strict metric than competitive ratio) and focus on the 
growth of regret with the time horizon length. 

As mentioned earlier, Kleinberg and Leighton in [9] studied dynamic pricing (online posted 
price auctions) as a special continuum-armed bandit problem. In particular, they analyzed the 
regret for three different cases of demand models. In the first scenario the customer's evaluations 
equal to an unknown single price in [0, 1] and the customers will only accept the offered price if 
it is below their evaluation. Kleinberg and Leighton showed that there is a deterministic pricing 
strategy achieving regret O(loglogT) and no pricing strategy can achieve regret o(loglogT) 
where T is the horizon length (or equivalently, the number of customers). In the second scenario 
the customer's evaluations are independent random samples from a fixed unknown probability 
distribution on [0, 1]. This model implies that the demand curve is the complement cumulative 
distribution function (CDF) of a certain random variable, which is more restrictive than a general 
demand model. For this scenario they showed that there is a pricing strategy achieving regret 
0(y/T logT). The last scenario considered in [9 | makes no stochastic assumptions about the 
demand model. It is shown that there is a pricing strategy achieving regret 0((T 2 / 3 (log T) 1 / 3 ) and 
no pricing strategy can achieve regret o(T 2//3 ). Besbes and Zeevi in [f2TTl considered a dynamic 
pricing problem under both parametric and non-parametric uncertainty models. They obtained 
lower bounds on the regret and developed algorithms that achieve a regret close to the lower 
bound. 

II. Problem Statement 

Consider a seller who offers a particular product to customers who come sequentially. For 
each customer, the seller proposes a price p from interval [l,u\; the customer accepts the price 
p with probability p(p). We call function p(.) the demand model. 

Before the first customer arrives, nature chooses a demand model from the set {piQIilo 1 as 
the ambient demand model. This choice is unknown to the seller; but the seller has knows the 
set of the potential demand models {PiQ}^ 1 (as shown later, this assumption can be relaxed 
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in the proposed policy). If price p t is offered to the f-th customer, the seller observes a binary 
random variable o t where o t — 1 (success) happens with probability p(pt) and o t = (failure) 
otherwise. The expected revenue at time t if the underlying demand model is pi(.) is 

n(pt) =PtPi(pt)- 

The seller aims to maximize the total revenue by offering prices sequentially. Under a horizon 
of length T, the pricing policy is defined formally as the sequence a = (a 1; a 2 , . . . , or), where 
a t is a map from past observations U*^(pj, Oj) to a choice of price in [l.u]. When there is no 
confusion, a t is also used to denote the action taken at time t. 

The expected total revenue if the underlying demand model is /?;(.) can be written as 

T 

R?(T)=EttJ2n(p t )}- 

t=l 

The regret defined as the expected revenue loss with respect to a seller who knows the underlying 
demand model is given by 

A?(T) = [rr i (pf)-J2f(T)]. 
It is easy to see that maximizing Rf(T) is equivalent to minimizing Af (T). 

III. The Bayesian Approach 

In this section we give a brief review of the work by Harrison et al. in ifTTI developed within 
the Bayesian framework for the special case of N = 2. In the Bayesian approach, the seller 
is equipped with priori knowledge of the underlying demand model: the seller knows that the 
underlying demand model is pi(.) with probability q . The objective of the seller is to maximize 
the expected average revenue 

maxi? a (T) = q R1(T) + (1 - q )R a (T). (1) 

a 

It is equivalent to minimizing the expected regret, 

min A a (T) = g A?(T) + (1 - q )A a (T) (2) 

a 

For finite time horizon T, this problem can be formulated as a partially observable Markov 
decision process (POMDP). 

State space: S = 0, 1 represents demand model or 1 respectively. 
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Action space: A = [l,u] represents all possible prices p t E [l,u\. 

Observation space: O = {0, 1}, where 1 represents success, and represents failure in sale. 
Transition probability: = Pr{S t+1 = i\S t = j, A t = a}. 
In our problem, 

Poi = Pw = 0. 

Observation model: h^ = Pr{O t = 9\S t+ i = j, A t = a}. We have 

Ko = l-Po(o), 
Ki = Po(a), 
Ko = 

Ki = Pi{a)- 

Immediate reward: The instant reward in state i, when the action a is chosen and 9 is 
observed is r^ e = 9a. 

Policy: a = [ax,..., ay] where a t is a mapping from the action and observation history 
{{Ax, . . . , A t _x}, . . . , O t -i}} to the action space A. 

In POMDPs the sufficient statistics of the action and observation history 

H t - 1 = {{A u ... i A±. 1 },{0 1 ,...,0^ 1 }}, 

for choosing the optimal action at each time is the posterior probability of the state at time t. 
This probability is referred to as belief or information state and is defined as 

q t = Pr^llif*- 1 }, (3) 
Qt = [l-q t ,q t ]- 

Optimality equations: The optimal policy at each time step t is a function of the current belief 
q t and is defined as follows ||22|| . Let V t (q t ) be the maximum total expected reward obtained 
from time steps t + 1 to T. 

i i 

V T (q T ) = nua{^g T (*)^^T^}, (4) 

i=Q 6»=0 
1 1 1 

V t (q t ) = ^{^g t (<)^/^r^ + ^Pr{O t = e|a t }H 4 -i(r(ft|a t ,fl))} ) 

i=Q 6=0 6=0 
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where T(q t \a t ,6)) = q t+ i is called the belief update and is defined as 



q t+ i = Pr{S t+1 = l\q t ,a t ,9 t } 

q t pi(a t ) dt (l - Pi(a t )) 



(5) 



qtpi(a t ) 6t (l - pi(a t )y- t + (1 - q t )p (a t ) e ^l - p (a t )) 



l-Ot ' 



The value function V (q ) is equivalent to CQ), and (O stated earlier. 

The optimal policy of the above formulated POMDP offers the maximum expected total 
revenue. However, finding the optimal policy to a POMDP is P-SAPCE hard in general [|23l . 

Harrison et al. in ifTTI considered the suboptimal myopic policy and focused on whether finite 
regret (i.e., complete learning) can be achieved rather than minimizing the exact value of the 
expected regret. The myopic Bayesian policy (MBP) at each step picks the price that maximizes 
the current expected revenue 



where q t is the belief at time t defined in © and ©. For any pricing policy that offers prices 
from the range [I, u] it was shown in [fTTTl that the belief converges to a limit almost surely. 

The limiting belief does not necessarily equal to or 1 (complete learning); it is possible 
that the policy gets stuck at a so-called uninformative price. The uninformative price is the 
price at which both demand models po(p) and p\{p) are equal. In order to deal with this issue, 
^-discriminative policies were considered. In particular a policy is ^-discriminative if 

IPoM?)) - Pi K (<?))! > 5, V£. 

It was shown in [11 J that if a policy is ^-discriminative, the belief will converge to either or 
1 exponentially fast. Therefore by restricting the MBP policy to be ^-discriminative (referred to 
CMBP in ifTTTl ) finite regret can be achieved. 



In this section we present our main result developed within the non-Bayesian framework. We 
first consider N = 2 and leave the general case to Sec. |V] We assume no prior probabilistic 
knowledge on which demand model may pertain. We formulate this problem as a two-armed 
bandit problem within the non-Bayesian framework as follows. Let 



p t = arg max{g t r x (p) + (1 - q t )r (p)}. 



pe[l,u] 



IV. The Non-Bayesian Approach 



p* k = arg max pp k (p) . 



pe[l,u] 



k = 0,l. 
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Arm k, k — 0, 1, is defined as the price p* k . If the underlying demand model is Pk(-), arm k is the 
better arm. Activating arm k is defined as offering price p k to the costumers. The reward random 
variable X k for arm k is defined as the revenue by proposing price p* k at each time. The reward 
mean of arm k is E[X fc ] = p* k Pi(p* k ) when pi{.) is the unknown underlying demand model. 
Throughout this paper we assume that no p* k is an uninformative price under any underlying 
demand model, meaning that for both k — 0, 1, 

Pl(P*k) ^ Po(P*k)- 

This assumption is needed in order to achieve finite regret. 

Activating each arm (offering price p k ) gives i.i.d. realizations of random reward X k . Since 
both arms share the same underlying demand model Pi(.), arms are correlated.In other words, 
observations from one arm also provide information on the quality of the other arm. 

We define regret or revenue loss as the following: 

2 

a, = Tp* Pi {p*) - J2{plPi(pinn}} 
k=i 

where T k is the number of times that arm k is selected, and pj(.) is the true underlying demand 
model. 

A. The LRT Policy 

In this section, we propose the LRT policy and establish its finite regret. For each arm k, let 
Y k denote the seller's observation when it activates arm k. It is a binary random variable with 
meanpi(p* k ). The LRT policy at each time step t is a function a t mapping from the observation 
space {yi, y 2 , • • ■ , Hi, ■ ■ ■ , Vt-i} to the action space {0, 1} (arms of the bandit). Specifically, in 
the first step t — 1, the LRT policy chooses an arm k e {0, 1} by flipping a fair coin. For each 
t > 1, let 

where fi(yj) = Pr{Y aj = Uj\p{.) = Pi( )} is the probability of observing yj when action a,j is 
chosen if the underlying demand model were Pi{ ). Then the LRT policy at each time step t 
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decides which arm to activate based on the following 

a t = l 

L(t-l) | 0, (7) 
at = 

where a t denotes the action at time t. It is easy to see that L(t) can be updated recursively as 

K t ) = i[( t -i W -i ) + ioggg]. 

The LRT policy is based on the maximum likelihood detector. In the following theorem we show 
that the LRT policy has finite regret. 

Theorem 1: The LRT policy achieves a bounded regret. 

Proof: The proof is a special case of the proof of Theorem |2] by setting rj = r\\ = ■ 

B. The XLRT Policy 

In this section we propose a generalization of the LRT policy to improve its regret performance. 
Based on the underlying detection nature of the problem, we generalize the LRT policy by 
introducing an exploration price. We aim to choose a price as our exploration price in order to 
accelerate the learning of the underlying demand model. 

In particular, this exploration price should be chosen such that px(p) and po(p) are most easily 
distinguished from random observations. Recognizing the detection nature of the problem, we 
adopt the Chemoff distance [24] which measures the distance between two distributions by the 
asymptotic exponential decay rate of the probability of detection errors. Specifically, for two 
probability density functions f and fx, the Chemoff distance is given by 

C(/o,/i) = max - log p(t), (8) 

p(t) = [[Mxtf-'ihWdx. 

Since calculating the Chemoff distance involves an optimization step, obtaining an analytical 
solution can be tedious. Johnson and Sinanovic in ll25l stated that the harmonic average of the 
Kullback-Leibler divergences ll26l of fo and fx can be a good approximation of the Chemoff 
distance which is easy to calculate. Namely, 

C(/o,A) = — 1 i > (9) 
i(M\h) + /(/ill/o) 
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where the Kullback-Leibler divergence of f with respect to fx is given by 

/(/o||/i)= / h{x)\og f ^\dx. (10) 

Note that Kullback-Leibler divergence is not symmetric and the Chernoff distance can be thought 
of as the symmetrized Kullback-Leibler divergence. 

The exploration price is thus chosen as the price that maximizes the Chernoff distance 
C(pi(p), po(p))- Observations obtained by this price are the most informative in distinguishing 
the two possible candidates of the demand model. The exploration price is offered when the log- 
likelihood ratio L(t) is close to 0, i.e., when it is most uncertain which demand model pertains. 
This is done by introducing two thresholds 771, and — rj instead of the single threshold in the 
LRT policy (see ©). The resulting policy is referred to XLRT as detailed below. 

Set the exploration price as 



p x = arg max{C(pi(p),po(p))}. (11) 

pe[i,u] 



At each step t: 



Po L (t - 1) < -Vo 
Pi 7 \ Px -Vo < L(t - 1) < 771 , (12) 

Pi m < Lit - 1) 

where Lit — 1) is the log likelihood ratio given in © and the two thresholds should satisfy the 
following conditions: 

rjt < min{/(p 1 (p*),p (^)),^(Pi(Px),Po(Px)),/(pi(^)'Po(Po))}' 
r] Q < min{/(po(^) ! pi(Pi)),/(po(Pa : ),Pi(Pa ; )),/(po(Po) 7 Pi(Po))}- 

This policy differs from the LRT policy only when the likelihood ratio L(t) is close to zero 
(indicating a greater degree of uncertainty on the underlying demand model). At such time 
instants, XLRT chooses the price p x that is most informative in learning the underlying demand 
model. Simulation examples demonstrate that XLRT policy improves the performance of the 
LRT policy (see Sec. ITU) . 

Theorem 2: The XLRT policy achieves a bounded regret. 
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Proof: Without loss of generality, we assume that the underlying demand model is pi(.). 
Let M e denote the expected number of times that the XLRT policy chooses the non-optimal 
price. 

M e = nj:i{jj:io g ^<r ]1 }] < ^nid^og^K^}] 



oo 



where 1{.} is the indicator function. Note that both the action aj and the observation yj at time 
j are random variables. To simplify the notation, let Z a - 3 = log L . We then have 

7 /o W) 

/r j (% 



= E[Pr{ij>f <r /l |{a,}$ =1 }]. 

where the expectation is over the action sequence {aj}* =1 . 

Notice that the action aj at each time j takes three possible values: {p^, p x ,p*}- We label these 
three prices as = pj, p(2) = p x , and p(3) = p\, and let Z fe = log 4}j-*{ = l og h w (Vi) 

h)(yj> /o (%) 

denote the log-likelihood ratio conditioned on that price (/c = 1,2,3) is chosen. It is 
easy to see that for each k = 1,2,3, Z^'s are i.i.d. binary random variables taking the val- 
ues {log log 7^777} with probabilities 1 — p 1 (p(k)) and p 1 (p(k)), respectively. Since the 

/o (") Jo ('■) 

underlying demand model is pi(.), we have 

nz 1 ;] = Eipog^] = i( Pl (p(k))\\ P Mk))) > 0, 

Jo KVj) 

where I(a\\/3) is the Kullback-Leibler divergence of two Bernoulli random variables with means 
a, and j3. 

Let m k = min{log|^,log|| T H}, M k = max{log log jJ^}, m = mm{m u m 2 ,m 3 }, 
and M = max{Mi, M 2 , M 3 }. Z^'s are independent and bounded in the interval [m, M] for all 
k = 1,2,3. 



14 



Let Q{ = {j : a, = k,j < t} for k G {1, 2, 3}, then 



j=l J"=l 



= E[Pr{iX;^<^ 1 |{a i }5 =lJ utiei}] 

fc=l /iGO* 

= Epr^Xi^N'ftliiiei}] 

fc=i free* 

= n^{\ E E - E ^i) < -(7 E E - *)i u ti ©**}]■ 



fc=i /igo* fc=i /iee* 

The second equality is because the sigma-algebra generated by U| =1 ©| is subset of the sigma- 
algebra generated by {aj}*- =1 . The fourth equality is because the conditional event {| X)fc=i Sfege* < 
771 j Ll| =1 0^,} is independent of {a,}* =1 . 

Note that conditioned on Uf, =1 6^,, the sum J2k=i S/iee* (^h~^[^h\) * s me sum °f * indepen- 
dent zero mean random variables. The rest of the proof is based on the following Hoeffding's 
inequality. 

Hoeffding's inequality: Let Xj's be independent random variables that are almost surely 
bounded in [m,, Mj], i.e., Pr{Xi e [m*, Mj]} = 1, then for some a > 0, 

Pr{i E(^ " E ™ < "«} ^ exp{- 2 " 2t2 -} (13) 

Therefore Hoeffding's inequality can be used for the conditional probability inside the expec- 
tation for independent random variables Zfc's that are bounded in [m, M] to obtain 

Pr ^E E (^-E[^)<-(jEE E [ z *]-^)i u *=i e *> 

3 



^ p ^EE^-« < -(^ 3 {/( Pl (^))||p (^)))}-r ?1 )|uL 1 eU 

= Pr 4 E E - E ^]) < " a l U '=i ©U < exp{- } = exp{- 1}, 

fe=i feeei ^ ' 



15 



where a = ^mm {I(p 1 (p(k))\\p (p(k)))} — r/i and C = ^ M 2 ^ ■ Recall that the thresholds are 
chosen such that 771 < min {I (pi(p(k))\\p (p(k)))} based on definition. Therefore a > 0. Hence 

fc=l,2,3 

Pr 4 E z 7 < = E t pr 4 E z i J < ^i{°i};=i}] < E t pr 4 E E - E t^]) < -°i u ti ©U] 

i=i i=i fc=i ftee* 

< E [ exp {-i}] = eX p{-^}. 

Hence 



M e <^Pr{-^log^-^<r ?1 }<^exp{--}< / exp{--} = C < 00. 

It shows that the expected number of times the non-optimal price is chosen is bounded by a 
finite number C. Therefore regret for the LRT policy is bounded above by C multiplied by a 
constant. ■ 

V. Extension to Finite Space Demand Uncertainty 

In this section we extend the LRT and XLRT policies to handle finite- space demand uncertainty 
where the underlying demand model is taken from a set {p (.),... , pjv-i(-)} wrtn an arbitrary 
finite cardinality N. 

This problem can be formulated as a multi-armed bandit problem in the same way as in 
section [IV] Arm k, k — 0, . . . , N — 1, is defined as the price p* k where 

p* k = arg max ppkip)- k = 0, . . . , N — 1. 

pe[i,u] 

If the underlying demand model is Pk(-), arm k is the best arm. As mentioned earlier in order to 
achieve finite regret we assume that no optimal price is uninformative under any demand model. 
In other words for all j, h, k £ {0, . . . , N — 1}, j 7^ h, 

Pjipl) Phipl)- 
The regret is also defined in a similar way as 

N-l 

A, = Tp* Pl (p*) - J2^Pi(pinTk]} 

fc=0 

where T*. is the number of times that arm k is selected. 

We present the extended versions of LRT and XLRT (referred to as ELRT and EXLRT) 
policies for an arbitrary N number of potential demand models. We also show that the proposed 
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ELRT policy achieves finite regret. Similarly, for each arm k, define the binary random variable 
Y k E {0, 1} with mean Pi(p k ). Y k is the random variable indicating the seller's observation when 
it activates arm k {i.e., offers price p* k ). The ELRT policy at each time step Ms a function a t 
mapping from the observation space {y 1 , y 2 , . . . , . . . , yt-i} to the action space {0, . . . , N — 1} 
(arms of the bandit). For the first step t — 1, choose an arm uniformly from k E {0, . . . , N — 1}. 
For the time step t > 1, let 



where fi(yj) = P*{Y a . = Vj\p{.) = pi(-)} is the probability of observing when action aj is 
chosen if the underlying demand model is Pi{.). 

The ELRT policy at each time step t selects arm a t = k for which 



Theorem 3: The ELRT policy achieves a bounded regret. 
Proof: The proof is a direct generalization of the Theorem [TJ and is given in Appendix for 



Similarly, we can introduce exploration prices to improve the rate of learning. Since there are 
now TV possible demand models, there are iV — 1 possible exploration prices. At each time, 
when the likelihood ration between the two models that are detected as the most likely is close 
to 0, the exploration price defined by the Chernoff distance between these two demand models 
is offered. Specifically, let 

pf = arg max {C( Pi (p) , p h (p) ) } . (16) 

pe[i,u\ 

At each time step t the policy first finds the two most probable demand models d\ and d 2 , where 
model d\ satisfies 




(14) 



L kJ (t-l)>0, Vj E{0,...,N-l},j^k. 



(15) 



completeness. 



L duj (t - 1) > Vj E {0, . . . , N - 1}, j ^ d x 



(17) 



and model d 2 satisfies 



L d2J (t-i)>0 Vje{0, 



...,N-l},j^d u j^d. 



"2- 



(18) 



Then 




P. 



P* dl L dl)d2 (t- 1) > 7] dl)da 

i" d2 0<L dl42 (t-l)< Vdl42 
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where the threshold r] dld2 is chosen to satisfy the following conditions: 

Vdl 42 < mm{I{p dl (p* di ) , p d2 (p* di ) ) , I{p dl (p^ ^ ) , p d2 (pf 1 ^ ) ) , I(p dl {p* d2 ) , p d2 {p* d2 ))}. 

VI. Simulations 

In this section we study the performance of the proposed policies. In Fig. \T\ and Fig. [2] shows 
the comparison of the LRT policy with the constrained MBP (CMBP)proposed in [11] under a 
binary demand model sets chosen in [fTD . In these two examples, we consider the same demand 
models used in the simulation studies in ifTTI . In particular, in Fig. [Q po(p) — 1-4 — 0.9p and 
pi(p) = 0.8 - 0.3p for the price range of [0.5, 1.5], and in Fig. El p (p) = 1+cxp( _ 1 io + io p) and 
Pi(p) = 1+exp (i 1+0 5p ) for the price range of [0,4]. The initial belief is chosen to be q = 0.5. 
We observe that the proposed LRT policy outperforms CMBP in both cases. 




100 200 300 400 500 600 700 800 900 1000 

Time Horizon 



Fig. 1. CMBP vs. LRT: Case 1 

Fig. [3] and Fig. |4] show the comparison of XLRT policy with LRT for the demand models 
Po(p) = 1.4 — 0.9p and pi(p) = 0.8 — 0.3p when the underlying demand model is p (.) and p x {.) 
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Fig. 2. CMBP vs. LRT: Case 2 

respectively. We observe that introducing the exploration price in XLRT considerably improves 
the regret. 

VII. CONCLUSION 

The dynamic pricing problem when the underlying demand model is unknown to the seller 
is considered where the demand model takes one of N possible forms where N is any arbitrary 
number. A non-Bayesian approach in which no prior knowledge on which demand model is 
governing the market is studied. The problem is formulated as a multi-armed bandit with 
correlated arms. A policy based on the likelihood ratio test is developed that achieves finite 
regret. An generalization of this policy is proposed by introducing an exploration price that helps 
the seller to learn the underlying demand model faster and improve the regret performance. 

Appendix: Proof of Theorem [3] 
Without loss of generality, we assume that the underlying demand model is Pi( ). 
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Fig. 3. XLRT vs. LRT when p = pa 



True Demand Model = p^.) 
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Fig. 4. XLRT vs. LRT when p = pi 
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Let M e denote the expected number of times that the ELRT policy chooses the non-optimal 
price. 

N—l T . t „a,j , \ 

h=a,h^i t=i j=i Jh 

N—l oo . t „a,j I \ 

h=0,h^i t=l j=l ->h \Vo) 



N-l oo 



h=0,h^i t=l j=l J h 



Note that both the action cij and the observation yj at time j are random variables. To simplify 
one of the N possible prices p* k 's, k = 0, . . . , N — 1. For each chosen price p* k , yj is a Bernoulli 



the notation consider a fixed /i and let Z^ = log V, p 1 . At each time j, the policy a.,- chooses 



, f k (y) f 3 " (v) 

random variable. Let Zf = log A = log i.- k denote the log-likelihood ratio conditioned 

3 ^ # (w) 

on that price is chosen. It is easy to see that for each k = 0, . . . , TV — 1, Z^'s are i.i.d. binary 

random variables taking the values {log 7^-, log ttttt} wmi probabilities 1 — Pi(pt) and Pi{pt) 

respectively. Since the underlying demand model is Pi(.), we have 

E[Z*\ = Ejlog £M] = I( Pi (pt)\\pM)) > 0. 
fhiVj) 

Let m\ = min{log log M£ = max{log log jf^}, m h = min{m?, m h N }, 

and M h = max{Mf , . . . , M N }. Zj's are independent and bounded in the interval [rrih, M h ]. 
Following the steps in the proof of Theorem |2] for r/o = rji = and conditioning on the event 
U^Z 1 6|. as the sufficient statistic for the action history {oj}*- =1 one can get 



^ A (%•) °h 

where C h = (Mh ,~" h)2 and a h = xmn k= o,...,N-i{I{pi{p* k )\\phipt))- Hence 

N—l 00 1 * -f a j / \ N—l 00 

m„< y, E^\E^^M<°} £ £ E-pf-^-} 



h=0,hjti t=l ' j=l Jn W h=0,h^i t=l 



N-l „oo , 

< ]T / exp{- — }<(/V-l)C<oc, 



h=0,h^i 



where C = max^ h ^{C h }. 
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It shows that the expected number of times the non-optimal price is chosen is bounded by a 
finite number (N — 1)C. Therefore regret for the LRT policy is bounded above by (N — 1)C 
multiplied by a constant. 

References 

[1] M. Rothschild, "A Two-armed Bandit Theory of Market Pricing", in Journal of Economic Theory, pp. 185202, 1974. 
[2] R. Bellman, "A Problem in the Sequential Design of Experiments", in Sankhya-: The Indian Journal of Statistics, vol. 16, 
no. 3/4, April 1956. 

[3] J. C. Gittins, "Bandit processes and dynamic allocation indices", in Journal of the Royal Statistical Society. Series B, vol. 
41, pp. 148177, 1979. 

[4] T. L. Lai and H. Robbins, 'Asymptotically efficient adaptive allocation rules", in Advances in Applied Mathematics, vol. 
6, no. 1, pp. 422, 1985. 

[5] R. Agrawal, "The Continuum-Armed Bandit Problem", in SIAM J. Control and Optimization, vol. 33, no.6, pp. 1926-1951, 
1995. 

[6] R. Kleinberg, "Nearly Tight Bounds for the Continuum-Armed Bandit Problem", in Advances in Neural Information 

Processing Systems (NIPS), pp. 697-704, 2004. 
[7] P. Auer, R. Ortner, C. Szepesvari, "Improved Rates for the Stochastic Continuum-Armed Bandit Problem", in Lecture 

Notes in Computer Science, vol. 4539, pp. 454-468, 2007. 
[8] E. W. Cope, "Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problem", in IEEE Transactions 

on Automatic Control, vol. 54, no.6, June 2009. 
[9] R. Kleinberg,T. Leighton, "The value of knowing a demand curve: bounds on regret for online posted-price auctions", in 
Proc. 44th IEEE Symposium on Foundations of Computer Science (EOCS), pp. 594-605, 2003. 
[10] A. McLennan, "Price Dispersion and Incomplete Learning in the Long Run," in Journal of Economic Dynamics and 

Control, Vol. 7, No. 3, pp. 331347, September 1984. 
[11] J. Michael Harrison, N. Bora Keskin, and Assaf Zeevi, "Bayesian Dynamic Pricing Policies: Learning and Earning Under 

a Binary Prior Distribution", in Management Science, pp. 1-17, October 2011. 
[12] D. Easley, N. M. Kiefer, "Controlling a Stochastic Process with Unknown Parameters", in Econometrica, vol. 56, no. 5, 
pp. 10451064, 1988. 

[13] P. Aghion, P. Bolton,C. Harris, and B. Jullien, "Optimal Learning by Experimentation", in The Review of Economic- 
Studies, vol. 58, no. 4, pp. 621654, 1991. 

[14] Y. Aviv, and A. Pazgal, "A Partially Observed Markov Decision Process for Dynamic Pricing," in Management Science 
Vol. 51, No. 9, pp. 14001416, September 2005. 

[15] V. E Farias, and B. V. Roy, "Dynamic Pricing with a Prior on Market Response," in Institute for Operations Research 
and the Management Sciences Vol. 58, No. 1, pp. 16-29, November 2009. 

[16] G. Keller,and S. Rady, "Optimal Experimentation in a Changing Environment," in The Review of Economic Studies, vol. 
66, no. 3, pp. 475507, 1999. 

[17] Z. Bar-Yossef, K. Hildrum, and E Wu, "Incentive-compatible online auctions for digital goods", in Proc. 13th Symp. on 
Discrete Alg. pp. 964-970, 2002. 



22 



[18] A. Blum, V. Kumar, A. Rudra, and F. Wu, "Online learning in online auctions", in In Proc. 14th Symp. on Discrete Alg. 
pp. 202-204, 2003. 

[19] A. Fiat, A. Goldberg, J. Hartline, and A. Wright, " Competitive generalized auctions", in Proc. 34th ACM Symposium on 

the Theory of Com- puting. ACM Press, New York, 2002. 
[20] A. Goldberg, J. Hartline, and A. Wright, " Competitive auctions and digital goods", in Proc. 12th Symp. on Discrete Alg., 

pp. 735-744, 2001. 

[21] O. Besbes, and A. Zeevi, "Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal 

Algorithms," in Operations Research, Vol. 57, No. 6, pp. 1407-1420, November-December 2009. 
[22] R. D. Smallwood and E. J. Sondik, "The Optimal Control of Partially Observable Markov Processes over a Finite Horizon", 

in Operations Research , pp. 1071-1088, 1971. 
[23] C. H. Papadimitriou and J. N. Tsitsiklis, "The complexity of Markov Decision Processes", in Mathematics of Operations 

Research, vol. 12, no. 3, pp. 441-450, August 1987. 
[24] H. Chernoff, "Measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations", in Ann. Math. 

Statist., vol. 23, pp. 493507, 1952. 
[25] D.H. Johnson and S. Sinanovic, "Symmetrizing the Kullback-Leibler Distance", Unreviewed Manscript available: 

http://www.ece.rice.edu/~dhj/resistor.pdf 
[26] S. Kullback and R. A. Leibler, " On information and sufficiency", in Ann. Math. Statist., vol. 22, pp. 7986, 1951. 



